{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/jermwatt/machine_learning_refined/blob/main/notes/6_Linear_twoclass_classification/6_6_Categorical.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "30MWxqYSoq41"
   },
   "source": [
    "## Chapter 6: Linear two-class classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook contains interactive content from an early draft of the university textbook <a href=\"https://github.com/neonwatty/machine-learning-refined/tree/main\">\n",
    "Machine Learning Refined (2nd edition) </a>.\n",
    "\n",
    "The final draft significantly expands on this content and is available for <a href=\"https://github.com/neonwatty/machine-learning-refined/tree/main/chapter_pdfs\"> download as a PDF here</a>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Iic62iv6oq42"
   },
   "source": [
    "# 6.6 The Categorical Cross Entropy Cost"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gmiLFUgfoq43"
   },
   "source": [
    "In Sections [6.2](https://jermwatt.github.io/machine_learning_refined/notes/6_Linear_twoclass_classification/6_2_Cross_entropy.html) and [6.3](https://jermwatt.github.io/machine_learning_refined/notes/6_Linear_twoclass_classification/6_3_Softmax.html) we saw how two different choices for label values, $y_p \\in \\left\\{0,1\\right\\}$ or $y_p \\in \\left\\{-1,+1\\right\\}$, result in precisely the same two class classification via Cross Entropy / Softmax cost function minimization.  In each instance we formed a *Log Error* cost per datapoint, and averaging these over all $P$ datapoints provided a proper and convex cost function.  In other words, the numerical values of the label pairs themselves were largely used just to *simplify* the expression of these cost functions.  Given the pattern for deriving convex cost functions for logistic regression, which we have now seen *twice* (once in each of the previous two Sections),  given any two *numerical* label values $y_p \\in \\left\\{a,b\\right\\}$ it would be a straightforward affair to derive an appropriate convex cost function based on a Log Error like pointwise cost.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sR64KOKfoq43"
   },
   "source": [
    "However the true range of label value choices is even broader than this - than two *numerical* values.  We can really use *any two distinct objects as labels* as well, i.e., two unordered values.  However regardless of how we define our labels we still end up building the same sort two class classifier we have seen previously - tuning its weights by minimization of a familiar cost function like e.g., the Softmax / Cross Entropy cost.\n",
    " \n",
    "To drive home this point, in this brief Section we show how to derive the same Cross Entropy cost function seen in [Section 6.2](https://jermwatt.github.io/machine_learning_refined/notes/6_Linear_twoclass_classification/6_2_Cross_entropy.html) employing *categorical labels* instead of numerical ones.  This leads to the derivation of the so-called *Categorical Cross Entropy cost* function which - as we will see - is equivalent to the Softmax / Cross Entropy cost functions we have seen previously."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZyzmLnU8oq44"
   },
   "source": [
    "##  One-hot encoded categorical labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BcqUMDmwoq44"
   },
   "source": [
    "Suppose we begin with a two class classification dataset $\\left\\{ \\left(\\mathbf{x}_{p},y_{p}\\right)\\right\\} _{p=1}^{P}$ with $N$ dimenisional input and transform our numerical label values $y_p \\in \\left\\{0,1\\right\\}$ with *one-hot encoded vectors* of the form \n",
    "\n",
    "\\begin{equation}\n",
    "y_p  = 0 \\longleftarrow  \\mathbf{y}_p = \\begin{bmatrix} 1 \\\\ 0   \\end{bmatrix} \\,\\,\\,\\,\\,\\,\\,\n",
    "y_p  = 1 \\longleftarrow  \\mathbf{y}_p = \\begin{bmatrix} 0 \\\\ 1   \\end{bmatrix}. \n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "N3IG0Agsoq44"
   },
   "source": [
    "Each vector representation uniquely identifies its corresponding label value, but now our label values are no longer *ordered numerical values*, and our dataset now takes the form $\\left\\{ \\left(\\mathbf{x}_{p},\\mathbf{y}_{p}\\right)\\right\\} _{p=1}^{P}$ where $\\mathbf{y}_p$ is defined as above.  However our goal here will remain the same: to properly tune a set of $N+1$ weights $\\mathbf{w}$ regress the input to the output of our dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LwO9cSkAoq44"
   },
   "source": [
    "##  Choosing a nonlinearity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "T3YvKzw5oq45"
   },
   "source": [
    "With these new *categorical labels* our classification task - when viewed as a *regression problem* - is a special case of *multioutput regression* (as detailed in [Section 5.6](https://jermwatt.github.io/machine_learning_refined/notes/5_Linear_regression/5_6_Multi.html)) where we aim to regress $N$ dimensional input against two dimensional categorical labels using a nonlinear function the linear combination $\\mathring{\\mathbf{x}}_{p}^T\\mathbf{w}^{\\,}$.  Because our categorical labels have length two we need to use a nonlinear function of this linear combination that produces *two* outputs as well.  Since the labels are one-hot encoded and we are familiar with the sigmoid function, it then makes sense to sense use the following nonlinear function of each input point $\\mathbf{x}_p$\n",
    "\n",
    "\\begin{equation}\n",
    "\\overset{\\,}{\\boldsymbol{\\sigma}}_p = \\begin{bmatrix}  \\sigma\\left(\\mathring{\\mathbf{x}}_{p}^T\\mathbf{w}^{\\,}\\right) \\\\ 1 - \\sigma\\left(\\mathring{\\mathbf{x}}_{p}^T\\mathbf{w}^{\\,}\\right) \\end{bmatrix}.\n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uoUS-Zsnoq45"
   },
   "source": [
    "Why? Because suppose for a particular point that $\\mathbf{y}_p  = \\begin{bmatrix} 1 \\\\ 0 \\end{bmatrix}$ and $\\mathbf{w}$ is tuned so that $\\sigma\\left(\\mathring{\\mathbf{x}}_{p}^T\\mathbf{w}^{\\,}\\right) \\approx 1$.  By the definition of the sigmoid this implies that $1 - \\sigma\\left(\\mathring{\\mathbf{x}}_{p}^T\\mathbf{w}^{\\,}\\right) \\approx 0$ and so that - for this point - $\\overset{\\,}{\\boldsymbol{\\sigma}}_p^{\\,} \\approx  \\begin{bmatrix} 1 \\\\ 0 \\end{bmatrix} = \\mathbf{y}_p$ which is indeed our desire.  And - of course - this same idea holds if $\\mathbf{y}_p  = \\begin{bmatrix} 0 \\\\ 1 \\end{bmatrix}$ as well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ndw2Uv37oq45"
   },
   "source": [
    "Thus with this nonlinear transformation an ideal setting of our weights $\\mathbf{w}$ will force \n",
    "\n",
    "\\begin{equation}\n",
    "\\overset{\\,}{\\boldsymbol{\\sigma}}_p^{\\,} \\approx \\mathbf{y}_p\n",
    "\\end{equation}\n",
    "\n",
    "to hold for as many points as possible."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HfrDiKNOoq46"
   },
   "source": [
    "##  Choosing a cost function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oOFW2z85oq46"
   },
   "source": [
    "As was the case with numerical labels, here we could also very well propose a standard pointwise cost taken from our experience with regression like e.g., the Least Squares\n",
    "\n",
    "\\begin{equation}\n",
    "g_p\\left(\\mathbf{w}\\right) = \\left\\Vert\\overset{\\,}{\\boldsymbol{\\sigma}}_p^{\\,} - \\mathbf{y}_p\\right\\Vert_2^2\n",
    "\\end{equation}\n",
    "\n",
    "and minimize the average of these over all $P$ points to tune $\\mathbf{w}$.  However as was the case with numerical labels, here because our *categorical labels* take a very precise binary form we are better off employing a *Log Error* to better incentivize learning.   Denoting $\\,\\text{log}\\,\\boldsymbol{\\sigma}_p^{\\,}$ the vector formed by taking the $\\text{log}\\left(\\cdot\\right)$ of each entry of $\\boldsymbol{\\sigma}_p^{\\,}$, here the Log Error takes the form"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "D23sgX-Doq46"
   },
   "source": [
    "\\begin{equation}\n",
    "g_p\\left(\\mathbf{w}\\right) = -\\mathbf{y}_p^T\\,\\text{log}\\,\\boldsymbol{\\sigma}_p^{\\,} = - y_{p,1}\\,\\text{log}\\left(\\sigma\\left(\\mathring{\\mathbf{x}}_{p}^T\\mathbf{w}^{\\,}\\right)\\right) - y_{p,2}\\,\\text{log}\\left(1 - \\sigma\\left( \\mathring{\\mathbf{x}}_{p}^T\\mathbf{w}^{\\,} \\right) \\right)\n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4NfR-S8Toq46"
   },
   "source": [
    "Taking the average of these $P$ pointwise costs gives the so-called Categorical Cross Entropy cost function.  Here the 'categorical' part of this name refers to the fact that our labels are one-hot encoded categorical (i.e., unordered) vectors.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rJXw1q5ooq46"
   },
   "source": [
    "However it is easy to see that this form is *precisely the Log Error we found in  [Section 6.2](https://jermwatt.github.io/machine_learning_refined/notes/6_Linear_twoclass_classification/6_2_Cross_entropy.html)*.  In other words, written in terms of our original *numerical* labels, the pointwise cost above is precisely"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xuemBVgWoq46"
   },
   "source": [
    "\\begin{equation}\n",
    "g_p\\left(\\mathbf{w}\\right)=\n",
    "-y_p\\,\\text{log}\\left(\\sigma\\left(\\mathring{\\mathbf{x}}_{p}^T\\mathbf{w}^{\\,}\\right) \\right) \n",
    "-\\left(1-y_p\\right)\\text{log}\\left(1 -\\sigma\\left(\\mathring{\\mathbf{x}}_{p}^T\\mathbf{w}^{\\,}\\right) \\right).\n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JwdDTdvBoq47"
   },
   "source": [
    "Therefore - even though we employed categorical versions of our original numerical label values - the cost function we minimize in the end to properly tune $\\mathbf{w}$ is precisely the same we have seen previously - the Cross Entropy / Softmax cost where numerical labels were employed.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "F7OCO1Bwoq47"
   },
   "source": [
    "Finally, note that this equivalency holds *regardless of whether we begain with numerical values and translated them into one-hot vectors*.  We could have begun with a dataset whose labels *originally came* in the form of one-hot vectors, and could have concluded precisely what we saw above. "
   ]
  }
 ],
 "metadata": {
  "colab": {
   "include_colab_link": true,
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "navigate_num": "#000000",
    "navigate_text": "#333333",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700",
    "sidebar_border": "#EEEEEE",
    "wrapper_background": "#FFFFFF"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "105px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": false,
   "toc_position": {
    "height": "656.3px",
    "left": "0px",
    "right": "1228px",
    "top": "116.267px",
    "width": "212px"
   },
   "toc_section_display": "block",
   "toc_window_display": false,
   "widenNotebook": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
